# Audio-Visual Fusion
Videollama2.1 7B 16F Base
Apache-2.0
VideoLLaMA2.1 is an upgraded version of VideoLLaMA2, focusing on enhancing spatiotemporal modeling and audio understanding capabilities in large video-language models.
Video-to-Text
Transformers English

V
DAMO-NLP-SG
179
1
Videollama2 72B
Apache-2.0
VideoLLaMA 2 is a multimodal large language model focused on video understanding and spatio-temporal modeling, supporting video and image inputs, capable of performing visual question answering and dialogue tasks.
Text-to-Video
Transformers English

V
DAMO-NLP-SG
26
10
Videollama2 8x7B
Apache-2.0
VideoLLaMA 2 is a multimodal large language model focused on video understanding and audio processing, capable of handling video and image inputs to generate natural language responses.
Text-to-Video
Transformers English

V
DAMO-NLP-SG
21
3
Videollama2 7B 16F Base
Apache-2.0
VideoLLaMA 2 is a multimodal large language model focused on enhancing spatio-temporal modeling and audio understanding in video comprehension.
Text-to-Video
Transformers English

V
DAMO-NLP-SG
64
2
Featured Recommended AI Models